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Abstract 


Generating natural language descriptions for in-the- 
wild videos is a challenging task. Most state-of-the-art 
methods for solving this problem borrow existing deep con¬ 
volutional neural network (CNN) architectures (Alexnet, 
Googlenet) to extract a visual representation of the input 
video. However, these deep CNN architectures are de¬ 
signed for single-label centered-positioned object classifi¬ 
cation. While they generate strong semantic features, they 
have no inherent structure allowing them to detect multiple 
objects of different sizes and locations in the frame. Our pa¬ 
per tries to solve this problem by integrating the base CNN 
into several fully convolutional neural networks (FCNs) to 
form a multi-scale network that handles multiple receptive 
field sizes in the original image. FCNs, previously applied 
to image segmentation, can generate class heat-maps effi¬ 
ciently compared to sliding window mechanisms, and can 
easily handle multiple scales. To further handle the ambi¬ 
guity over multiple objects and locations, we incorporate 
the Multiple Instance Learning mechanism (MIL) to con¬ 
sider objects in different positions and at different scales 
simultaneously. We integrate our multi-scale multi-instance 
architecture with a sequence-to-sequence recurrent neural 
network to generate sentence descriptions based on the vi¬ 
sual representation. Ours is the first end-to-end trainable 
architecture that is capable of multi-scale region process¬ 
ing. Evaluation on a Youtube video dataset shows the ad¬ 
vantage of our approach compared to the original single¬ 
scale whole frame CNN model. Our flexible and efficient 
architecture can potentially be extended to support other 
video processing tasks. 


A cot and o bear meet in the woods. 



Figure 1. To generate an accurate and detailed description of a 
video, the visual representation must capture multiple objects in 
the frame, each of which may have different sizes in the frame. 
This paper proposes a convolutional neural network integration ar¬ 
chitecture for video description that simultaneously searches over 
multiple locations and receptive field sizes. 

1. Introduction 

The ability to automatically describe videos in natural 
language has many real-world applications. For example, 
it could be used to create one-sentence summaries of short 
video clips in a user’s collection for an easier browsing ex¬ 
perience. Other applications include content-based video 
retrieval, descriptive video service (DVS) for the visually 
impaired, and automated video surveillance. 

Most current approaches to this problem make use of 
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pre-trained deep convolutional neural networks (CNNs) as 
semantic feature extractors for each video frame. These 
CNN models (e.g. AlexNet [11], GoogLeNet [21], 
VGG [20]) are trained to predict a single object label on 
images where objects are usually center positioned and oc¬ 
cupy most of the image. However, realistic videos are much 
more complex and contain several objects of different scales 
in different positions of each video frame, including small 
objects. As Figure 1 illustrates, applying such a CNN to 
the full frame (top) only detects some of the semantic cat¬ 
egories, in this case, only woods. To detect smaller objects 
and actions, receptive fields of different sizes (relative to the 
original image size) must be used. 

Region processing using a CNN detection model [5] has 
been proposed for image-to-text description [10, 4], how¬ 
ever, applying this model to video would incur considerable 
computational cost. Furthermore, the region proposal step 
would prevent end-to-end training of the network. A major 
advantage of a fully-convolutional network (FCN) is that it 
can efficiently generate a spatial score map for each class, 
instead of a single score for each class as in the classifi¬ 
cation CNN network. Each score in the output score map 
corresponds to one receptive field in the input image. By 
up-scaling the input image, FCN can generate larger size 
score maps whose receptive fields are smaller in the origi¬ 
nal image. Thus incorporating FCN can capture semantic 
concepts at different scales and locations in the original in¬ 
put frame. FCN is more efficient than both region proposal 
methods and the sliding window mechanism. 

In this paper, we propose the first end-to-end trainable 
video description network that incorporates spatially local¬ 
ized descriptors to capture concepts at multiple scales. We 
combine the traditional classification CNN that operates on 
the scale of the whole image with several smaller recep¬ 
tive fields using an FCN. We further incorporate a Multiple 
Instance Learning (MIL) mechanism to deal with the uncer¬ 
tainty of object scales and positions. The resulting semantic 
representation of the frames is encoded into a hidden state 
vector and then decoded into a sentence using a variant of 
a recurrent neural network proposed in [23]. We call our 
model the Multi-scale Multi-instance Video Description 
Network (MM-VDN). 

After generating the spatial score maps for the seman¬ 
tic concepts, our network still has to cope with the uncer¬ 
tainty over the number, location, and scale of the detected 
concepts. Traditional approaches for object detection use 
training data annotated with the location and size of each 
object. Such training data are not generally available for 
videos as they are very time-consuming to collect. There¬ 
fore, we propose to use the weak supervision available in 
the form of sentences to train the network. We incorporate 
an MIL mechanism, which treats the score maps as bags of 
examples, each corresponding to some region in the image. 


An MIL mechanism is applied separately at each scale to 
select the most likely location, and also applied to select the 
most likely scale. 

Pre-training deep representations on a large classification 
dataset like ImageNet has proven to be a powerful initial¬ 
ization for many computer vision tasks. We therefore use 
a pre-trained version of AlexNet, converting it to a fully- 
convolutional multi-scale network. We note that our ap¬ 
proach is general and can be applied to other popular CNNs, 
such as GoogLeNet or VGG. We present a detailed evalua¬ 
tion on a large corpus of Youtube videos [2]. We evaluate 
different input scales and training regimes, and show im¬ 
proved results compared to previously proposed architec¬ 
tures. 

In the next section, we review related work on multi¬ 
scale region processing and video description generation. 
Then, we describe the design of our MM-VDN network ar¬ 
chitecture. Finally, we present experimental results and dis¬ 
cussion. 

2. Related work 

Early video description generation methods explicitly 
constructed a CRF semantic role representation, or Subject- 
Verb-Object triple, for each video, and used a template 
model to generate a sentence [19, 22]. Recently, CNNs 
combined with recurrent neural networks have been applied 
to the related still-image description task, achieving good 
results [25, 15]. They used the Long Short Term Memory 
recurrent network (LSTM [7]), which incorporates explic¬ 
itly controllable memory units that allow it to learn long- 
range temporal dependencies that are very difficult to learn 
using traditional recurrent networks. 

The hybrid CNN-LSTM recurrent neural network archi¬ 
tecture has also recently been applied to video description 
generation [24, 27]. [24] first extracts a deep representa¬ 
tion of each frame in a video using AlexNet, mean-pools 
the resulting vectors to produce a vector representation of a 
complete short video, and then trains an LSTM to decode 
this vector into a sequence of words in order to produce a 
descriptive sentence. However, both of these papers directly 
use a standard ImageNet-trained CNN model (AlexNet and 
GoogLeNet) as a feature extractor, without considering the 
difference between ImageNet and the video domain, such 
as the object and activity scale, position and the presence 
of multiple labels in each frame. We improve on this ap¬ 
proach by proposing a multi-scale multi-instance FCN in¬ 
tegration model to represent each frame, combined with a 
sequence-to-sequence variant of LSTM [23] to enable se¬ 
quential frame processing. 

Before deep learning features gained popularity, 
researchers proposed spatial pyramid pooling[12], 
ObjectBank[13], and other mechanisms to deal with 
the object scale and position problem for handcrafted 
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features (SIFT, HOG, etc.). More recently, [16] attempted 
to solve the object scale and position problem in multi-label 
images using a deep CNN. They make use of AlexNet 
and add two extra convolutional layers on top of its fc7 
layer, and propose a sliding window mechanism using this 
improved AlexNet model. However the sliding window 
mechanism is inefficient compared with the recent fully 
convolutional neural network (FCN) [14]. 

The FCN model can be obtained by turning the fully 
connected layers in the original classification CNN net¬ 
work into convolutional layers. This model can make use 
of the original classification network’s weights as weight 
initialization but outputs spatial score heat maps for all the 
classes. This idea was applied to semantic segmentation to 
produce pixel-wise semantic labels [14] [9]. But they just 
use single input scale and get one set of output score maps, 
then combine the up-scaled output score maps with the pre¬ 
vious pooling layers to improve the segmentation result. We 
utilize FCN in our video description framework to obtain 
scores for different input scale concepts in the video. 

MIL is a well-known weakly supervised learning method 
that has been recently applied to object classification using 
deep CNNs. One advantage of deep MIL is that it learns 
a representation as well as a classifier. To the best of our 
knowledge, it has not been previously attempted for video 
understanding tasks. In [18], the authors introduce the con¬ 
cept of multiple instance learning with FCN to make use 
of multi-class image labels for training, but their ultimate 
goal is still image segmentation. We use MIL within each 
input scale’s FCN to deal with the uncertainties about the 
object positions within each scale, and use MIL between 
different input scale’s FCNs to deal with the uncertainties 
about different object scales. Our goal is to capture a richer 
representation of the input frame that can eventually help in 
generating better visual features for tasks such as video cap¬ 
tioning. We note that our model can also be directly applied 
to other applications such as multi-label image classifica¬ 
tion. 


the high-level concepts present in the frame. The second 
part of the network is a recurrent component that takes the 
sequence of vectors vi^...^vv as input and outputs a se¬ 
quence of words ici,..., WW‘ 

Specifically, for the first t = time steps, the 

network receives a frame image It, and applies the visual 
subnet to produce the hidden state Vt = using a 

series of nested convolutional operations whose joint set of 
parameters is represented by 0^: 

/ = /l ° /l-1 ■ • ■ o /l (1) 

Each layer L is defined by its type: a matrix multiplica¬ 
tion for convolution or max pooling, an element-wise ReLU 
non-linearity for an activation function, a normalization 
layer, etc. Note that, unlike AlexNet, our visual network 
is fully convolutional at all layers. 

The visual hidden state is then passed to the recurrent 
subnet, which produces a hidden state Zf encoding each ad¬ 
ditional frame’s visual information: 




( 2 ) 


where 0^ represents the parameters of all layers in the 
recurrent subnet, and g is typically a matrix multiplica¬ 
tion followed by an element-wise non-linearity. Once the 
frames have been encoded, for the next time steps t = 
y +1,..., y + IL, the model starts to decode. Each hidden 
state Zt-i of the previous time stamp is used to obtain the 
emitted word Wt via a softmax function: 


p{wt\zt-l) 


E»'eDexp(0»'2t-i) 


(3) 


where D is the word vocabulary. The model learns by max¬ 
imizing the probability of the correct word sequence given 
the input frames: 


p{wi,...,ww\Ii,---Jv) (4) 


3. Approach 

We present the Multi-scale Multiple Instance Video De¬ 
scription Network (MM-VDN), an end-to-end deep neural 
network that accepts short videos and produces human-like 
sentence descriptions. We first give an overview of the ap¬ 
proach, then describe the components in more detail. 

3.1. Overview 

An overview of the architecture is shown in Figure 2. 
The network first reads in and processes frames, then gen¬ 
erates a sequence of words. The first part of the network is 
a visual representation extraction component that processes 
each input frame to produce an A^-dimensional hidden se¬ 
mantic state Vt, with elements corresponding to scores for 


More details on how this probability is modeled in our 
framework are given below. 

We first present the details of the visual subnet, which 
is the main contribution of this paper, followed by a brief 
overview of the sentence generation recurrent subnet. For 
full details of the sentence generation recurrent subnet we 
refer the reader to [23]. 

3.2. Multi-scale Region Processing using an FCN 

The visual subnet (Figure 2) is structured as several 
multi-scale fully convolutional networks connected via the 
MIL mechanism. We use a base CNN (here AlexNet) 
that consists of five convolutional and pooling layers fol¬ 
lowed by three fully-connected layers. The first scale is the 
base pre-trained CNN classification network applied on the 
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Figure 2. We propose a new end-to-end deep neural network architecture for describing video in natural language. The input to the network 
is a sequence of video frames, and the output is a sequence of words (a phrase or sentence). Each frame is processed by a multi-scale 
multi-instance convolutional network and embedded into a iV— dimensional high-level semantic vector, corresponding to activations of N 
high-level concepts. A recurrent network accepts such semantic vectors from all frames in the video, and then decodes the resulting state 
into the output sentence. Unlike previous approaches that used a single-scale single-label architecture (top stream in our network), our 
network can handle ambiguity in the number, size and location of objects and activities in the scene. Shared parameters transferred from 
AlexNet are shaded. See text for more details. 


whole frame to capture scene-level semantics. Additional 
scales consist of the same CNN network but applied in a 
fully convolutional manner across upsampled versions of 
the original frame. The MIL mechanism consists of sev¬ 
eral layers of max-pooling and allows the latent position 
and scale of concepts to be discovered simultaneously dur¬ 
ing learning. 

The goal of the visual subnet is to take an input frame It 
and produce a set of probabilities Vt for each high-level con¬ 
cept present in the frame. Typically, this is accomplished by 
a CNN trained on image classification, but a classification 
CNN works best for concepts occupying most of the frame. 
We therefore cast the CNN as a fully convolutional network 
to process arbitrary region sizes. 

We start by taking the AlexNet model [11] pre-trained 
on ImageNet. AlexNet is a well-studied CNN architecture, 
which consists of five convolutional/pooling layers (convl- 
5) and three fully-connected layers (fc6-8). The input im¬ 
age size of AlexNet is designed to be 227 x 227, however, 
in practice, the input image is resized to 256 x 256, then 
cropped and mirrored to ten patches of size 227 x 227. Lay¬ 
ers fc6 and fc7 have K neurons each, in the case of AlexNet, 
K = 4096. Each neuron in the last fc8 layer is trained to 
predict the probability of a single concept, with a total of 
N possible concepts. Standard pre-training on ImageNet 
ILSVRC-IK classification data yields N = 1000, however, 
we experiment with other cardinalities of concept spaces. 

Our FCN conversion of AlexNet changes the last three 
fully-connected layers into convolutional layers, while the 
first five convolutional layers are kept the same. The 


weights in the last three fully-connected layers of AlexNet 
are converted to be the filter weights in the last three con¬ 
volutional layers of the FCN. For example, the output of 
conv5 in AlexNet is 256 x 6 x 6, which is first concate¬ 
nated into a vector of size 9216, and then connected to fc6. 
The weight matrix between these two layers is thus of size 
4096 X 9216 in AlexNet. In the FCN, the output of conv5 
is no longer concatenated, the size of the filter weights be¬ 
tween conv5 and conv-fc6 is of size 4096 x 256 x 6 x 6, 
which can be obtained by reshaping the 4096 x 9216 dimen¬ 
sion fc6 weights. Because of this direct conversion relation¬ 
ship, we can initialize the weights of the FCN directly from 
pre-trained AlexNet weights, and then further fine-tune the 
concepts on our specific task. FCN can accept arbitrary in¬ 
put image size, so by upsampling the input image, we can 
get larger output score maps. 

The weights of the first seven layers are shared across 
scales (shown by shading their outputs in Figure 2). The 
fc8 weights are not shared, to allow different concepts to be 
learned at different scales (e.g., scenes vs objects). We ex¬ 
periment with learning the fc8 weights, either starting from 
initial pre-trained AlexNet weights, or starting with zero- 
initialized weights and learning concepts from scratch. A 
potential advantage of learning new concept weights is that 
the original pre-trained concepts may not align with those in 
the video corpus. For example, concepts pre-trained on the 
ImageNet ILSVRC-IK challenge do not include the con¬ 
cept of the person object, while this happens to be the most 
frequent object mentioned in our Youtube corpus. 

In addition to the original whole-image AlexNet, we cre- 
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ate several FCNs using the above conversion, one for each 
scale. For the input frame It and label set be each FCN 
produces an output score map y) for the label at 
location {x,y). 

3.3. Multiple Instance Learning over Locations 

Performing supervised learning of semantic concepts in 
the video frame would require labeled data in the form of 
frames and object labels at each location and scale. Since 
such data is very difficult to obtain, we resort to a multi¬ 
ple instance learning approach. In contrast with supervised 
learning, MIL allows weaker forms of supervision where 
training examples come in bags. Negative bags typically 
contain exclusively negative instances, while positive bags 
are only known to contain at least one positive instance. 
Thus, true positive labels are latent and must be discovered 
during learning. 

In our case, we consider a positive bag to be all image 
patches corresponding to all possible receptive field loca¬ 
tions at all scales in a given frame. If our task were con¬ 
cept detection, the positive bag label would be the pres¬ 
ence/absence for each of N possible concepts. We can treat 
the words in the sentence as concepts to be predicted. How¬ 
ever, unlike in traditional MIL where the bag label is unam¬ 
biguous, here there is additional ambiguity as we only have 
access to a sequence of words for the entire sequence of 
frames. We must therefore solve the alignment problem of 
assigning words to each frame. In our approach, this is im¬ 
plicitly handled by the recurrent LSTM network, described 
below. For now, we assume the bag label is itself latent and 
must be inferred during network training. 

We define a MIL max pooling layer on top of each FCN 
output score map to capture the maximum score Pc{xc, Vc) 
for each label c, which infers the latent object positions, 
(xc, Vc)^ which denotes the position of the maximum score 
for class c, is obtained by a max-pooling operation: 

(a^c^/c) = argmaxpc(a:,2/) Vc G ^ (5) 

V(cc,?/) 

3.4. Multiple Instance Learning over Scales 

As mentioned above, the input image size of AlexNet is 
227 X 227, and the output is a score vector of size N of 
concepts After the FCN conversion, if the input image 
size for the FCN remains unchanged (227 x 227), the output 
will be 1 X 1 X score maps. When the input image size 
of the FCN increases, the size of each output score map 
also increases io h x h x N, where h is the output score 
map size. However, for a fixed FCN structure, each score in 
the output score map corresponds to a fixed size receptive 
field in the input image (355 x 355 [14]). The receptive 
field size in FCN is determined by the FCN structure, and is 
independent of the input image size. For small input image 


Input Image Size 

Score Map Size 

Height Ratio 

227 X 227 

1x1 

100% 

259 X 259 

2x2 

100% 

323 X 323 

4x4 

100% 

451 X 451 

8x8 

78.7% 

707 X 707 

16 X 16 

50.2% 


Table 1. Height ratio of the receptive field to the original input 
image for different input sizes in the FCN. The size of the receptive 
field is 355 x 355 for all input image sizes. 


size, the receptive filed area (355 x 355) may contain some 
padding areas in the margin of the input image. 

When the input image of the FCN has been upscaled, 
each score in the output score map corresponds to a smaller 
region in the original unsealed image. Thus, by using a 
FCN coupled with an upscaled input image, we can cap¬ 
ture smaller objects. We further combine several FCNs 
with different input image sizes (scales), and apply MIL 
across the scales to capture concepts of different scales si¬ 
multaneously. The ratio of the receptive field height to the 
original image height for several different input scales is 
shown in Table 1 . Note that, for several input scales in Ta¬ 
ble 1, the input image size is smaller than the receptive field 
size (355 x 355). This occurs because the margin of sev¬ 
eral layers’ output (conv2/conv3/conv4/conv5) in the FCN 
is padded to generate the final score maps. 

We define an additional MIL element-wise max layer on 
top of the multi-scale FCNs to select among different input 
scales s for each concept. The output of this layer is the 
final semantic concept vector for the tth frame: 

Vt = maxpc(^c, Vc, s) Vc G ^ (6) 

V(s) 

The loss on this layer (as on all others) is propagated back 
from the word output layer. Next, we describe how this 
visual representation is aggregated across frames and trans¬ 
lated to the output description sentence. 

3.5. Recurrent Subnet for Word Generation 

We use a two-layer encoder-decoder LSTM model pro¬ 
posed in [23] to generate descriptions of video. The en¬ 
coder part encodes the visual input for each frame, and the 
decoder part accepts the hidden state from the encoder and 
outputs words in sequence. Special symbols are used to in¬ 
dicate the beginning and end of the sentence. The encoding¬ 
decoding paradigm is used to estimate the conditional prob¬ 
ability of an output sequence (ici, • • • , ww) given an input 
sequence of visual state vectors (^i, • • • , ^y), see Eq. 4. 

In the encoding phase, a part of this conditional probabil¬ 
ity is computed by generating a fixed length representation 
^ based on the entire sequence of inputs (i;i,..., vy)- The 
decoding step then computes the probabilities of the output 
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Figure 3. Structure of the LSTM-based recurrent network used in our approach (figure from [23]). Two hidden layers with LSTM cells are 
used; the first layer encodes the visual input for each frame, and the second layer accepts the hidden state from the first layer and outputs 
words. Special symbols are used to indicate the beginning and end of the sentence. 


sequence of words (tui,..., ww) as: 

w 

p{wi,. . .,Ww\Vl, ...,Vv) = Wp{Wi\z,Wi, . . 

where the distribution of p{wt\z^ tui,..., tut-i) is given by 
a softmax over all of the words in the vocabulary D. The 
overview of the architecture is shown in Figure 3, see [23] 
for more details. 

4. Experiments 

In this section, we evaluate several variants of our ap¬ 
proach and compare it to related work. We use the 
BLEU[17] and METEOR[l] scores to evaluate the gener¬ 
ated sentences against all reference sentences. BLEU is the 
most commonly used metric in image description literature, 
but METEOR is also shown to be a good evaluation metric 
in a recent study [3]. 

4.1. Dataset and Preprocessing 

We perform our experiments on the Microsoft Research 
Video Description Corpus (MSVD) [2]. The dataset con¬ 
tains 1,970 short Youtube video clips paired with multiple 
human-generated natural-language descriptions 40 En¬ 
glish sentence descriptions per video). The video clips are 
10 to 25 seconds in duration and typically consist of a sin¬ 
gle activity. The 1,970 videos are split into training set 
(1,200 videos), validation set (100 videos) and testing set 
(670 videos), as used by the prior work on the same video 
description task [6] [22] [26] [24] . We perform model selec¬ 
tion on the validation set based on the BLEU score. 

To pre-train a CNN model which is targeted to our video 
description dataset, we perform transfer learning by pre¬ 
training the classification CNN on ImageNet. Eor this pur¬ 
pose, we compose a subset of ImageNet consisting of the 
566 categories (synsets) which are present in the MSVD 
dataset. To produce the list of synsets, we manually 


matched each subject or object that appears in the sentences 
to the closest synset available in ImageNet. This reduced 
the more than 900 initial nouns present in MSVD to the 
566. We compose the dataset from ImageNet single label 
images and fine-tune the BVLC reference model (AlexNet) 
[8] on these categories (all layers are fine-tuned). We use 
this model as initialization of the AlexNet model in our ex¬ 
periment. We transfer the 566 category AlexNet fc6/fc7/fc8 
weights to the ECN conv-fc6/conv-fc7/conv-fc8 layers by 
reshaping, and initialize the ECN convl-5 layer weights di¬ 
rectly from the 566 category AlexNet. 

We also experiment with using the BVLC published 
1000-category AlexNet weights as the weight initialization 
for our model. We find that the experiment results, mea¬ 
sured by BLEU and METEOR, are almost the same as using 
the 566 MSVD category fine-tuned model as weight initial¬ 
ization. In the following experiments, we still choose to use 
the 566 category fine-tuned model weights as weight initial¬ 
ization of our model. 

We also tried several fine-tuning mechanisms with our 
integrated model, e.g. first fine-tuning each single-scale 
ECN to get the best set of parameters, and then doing a 
global fine-tuning pass on all the parameters that need to 
be fine-tuned. Experimental results did not show any obvi¬ 
ous improvements in our setting, so we choose to directly 
fine-tune all the parameters together. 

4.2. Relationship to Related Work 

Several works [6] [22] [26] that explored template based 
sentence generation method on the same MSVD video 
dataset only reported SVO accuracy and did not provide 
a BLEU/METEOR value for generated sentences. Recent 
papers using the CNN and LSTM combination architecture 
to directly generate sentences for images or videos usu¬ 
ally report the BLEU/METEOR scores. [24] uses mean 
pooling of CNN fc7 features as the visual input at each 
time stamp of a two-layer LSTM to generate descriptions 
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for videos. We use their reported LSTM-YT model re¬ 
sult (without fine-tuning on still-image description COCO 
or FLICKR datasets) as one baseline. We also compare to 
BLEU and METEOR scores for a template model called 
FGM proposed in [22] which uses a factor graph to improve 
the SVO prediction. 

Since our main contribution is to incorporate region pro¬ 
cessing at multiple scales, we compare our model to oth¬ 
ers that use the same underlying AlexNet model, but at 
the standard whole-image scale. Two papers currently in 
review ([23] and [27]) showed that higher overall perfor¬ 
mance can be obtained by replacing AlexNet with the more 
powerful GoogLeNet or VGG CNNs. [23] also incorporate 
additional optical flow features, while [27] add 3-D con- 
vnet motion features trained on extensive activity corpora. 
To enable fair comparison, we omit models that use better 
underlying features or add motion features. 

[23] extends the two-layer LSTM model used in [24] 
to encoder-decoder mode by padding the video frame se¬ 
quence and word sequence. We integrate our multi-scale 
FCN and MIL model with this two-layer encoder-decoder 
LSTM model, and show the effectiveness of our multi-scale 
FCN and MIL mechanism. 

4.3. Ablation Study of Different Scales 

In our model, the input image for the AlexNet part is re¬ 
sized to 256 X 256 and cropped to 227 x 227 to generate 
five candidate patches (four comers and one center) without 
mirroring. The input size for other scales is directly set to 
be the input size listed in Table 1, without cropping or mir¬ 
roring. The AlexNet weights are initialized with the 566- 
category ImageNet model, and convl to fc7 weights are 
kept fixed during training. The FCN weights are initialized 
with the reshaped 566-category ImageNet model weights, 
and convl to conv-fc7 weights are also kept fixed. The fc8 
and conv-fc8 layers directly connect with the LSTM recur¬ 
rent neural network via max operations, and only fc8, conv- 
fc8 and the LSTM parameters are fine-tuned on the training 
videos. We have also tried to fine-tune the weights in conv- 
fc6 and conv-fc7, but the result was poor compared with 
keeping these layers fixed. 

In this set of experiments, we investigate the effect of 
using different single input scales in the FCN model, and 
several combinations of different scales. The BLEU and 
METEOR scores are shown in Table 2. For a single scale, 
the original AlexNet whole-frame scale is actually better 
than the other two scales alone. However, the combina¬ 
tion of whole-frame scale and a scale of 451 x 451 gets 
a boost of 4.7 in BLEU value and a boost of 1 in METEOR 
value. If we further add the input scale of 707 x 707, per¬ 
formance is degraded. This indicates that the optimal scales 
are dataset-specific; additional scales could be needed to 
achieve good performance on other datasets. Our model is 


Score Map Size 

BLEU 

METEOR 

AlexN etijvic 

32.96 

28.04 

AlexNet"^ 

32.80 

28.09 

8x8 

31.05 

27.44 

16 X 16 

27.92 

25.90 

AlexNet"^ +8x8 

37.64 

29.00 

AlexNet"^ + 16 x 16 

26.59 

25.2 

AlexNet* + 8 x 8 + 16 x 16 

31.94 

26.99 


Table 2. BLEU and METEOR results for the ablation study. 
AlexNet"^ represents the fine-tuned AlexNet model. 


method 

BLEU 

METEOR 

FGM [24] 

13.68 

23.9 

LSTM-YT [24] 

31.19 

26.87 

MM-VDN {AlexNet* -h 8 x 8) 

37,64 

29,00 


Table 3. Comparison to other baselines. 


flexible enough to integrate different input scales and com¬ 
binations of several scales, depending on the dataset. 

4.4. Comparison to Published Results 

There are several variations of BLEU and METEOR cal¬ 
culation methods. We use the same BLEU and METEOR 
calculation script as was used by the authors of [24] to pro¬ 
duce results for all baselines and our method. The compar¬ 
ison of the best MM-VDN model from the ablation study 
(AlexNet + 8x8 score maps) to the other two baselines 
is listed in Table 3. Our MM-VDN model provides a dis¬ 
tinct improvement in both BLEU and METEOR compared 
to the other two baselines. We note that even better results 
can be obtained by our model using pre-training on still im¬ 
age data [24] or by replacing AlexNet with deeper convnets 
such as [20, 21]. We leave this for future work. 

Some examples of the generated sentences are shown in 
Table 4. In several cases, such as in row 1, 3, 4, 5 of Ta¬ 
ble 4, our model correctly identifies smaller objects that are 
missed by the whole-frame AlexNet model, like carrot, gui¬ 
tar, piano and skateboard. 

4.5. Visualizations 

In this section, we show some visualizations of the mul¬ 
tiscale features learned by our MM-VDN model. In the fol¬ 
lowing table 5, the first column shows one sampled frame 
from each video. The second column shows one 8x8 
conv-fc8 heat map for this sampled frame, which corre¬ 
sponds to the channel with highest activation value in FCN 
with input size 451 x 451. The third column shows the 
value of the whole frame AlexNet fc8 feature for this sam¬ 
pled frame. The fourth column shows the value of the max 
pooled conv-fc8 feature in FCN with input size 451 x 451 
for this sampled frame. The last column shows the feature 
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value of the combination of two scales (max of AlexNet fc8 
and FCN451 conv-fc8). 

The sampled frames are shown in their original size. The 
heat maps clearly show that the model is able to localize 
smaller regions and assign them to a high-level semantic 
feature. The histograms indicate that the highest-scoring 
semantic conv-fc8 feature is not always the same as the 
highest-scoring whole-frame fc8 feature, and that the fea¬ 
ture values of the two scales can be complementary to each 
other. 

5. Conclusion 

This paper proposed a Multi-scale Multi-instance Video 
Description Network (MM-VDN), which combines a con¬ 
volutional and a recurrent part to simultaneously learn to ex¬ 
tract useful high-level concepts and generate language. The 
MM-VDN model integrates the Fully Convolutional Net¬ 
work (FCN) conversion of a classification CNN to poten¬ 
tially capture medium and small scale concepts in the video 
frame. It also incorporates a Multi-instance learning mech¬ 
anism to deal with the uncertainty about the number, posi¬ 
tion and scale of concepts in the video frame. The model 
is shown to be effective on the task of video description 
generation, compared to the single scale whole-frame clas¬ 
sification CNN. 

The MM-VDN model is efficient and extensible. Its 
efficiency makes it especially suitable to process video, 
where region proposal generation mechanisms would be 
prohibitively slow. It can integrate several FCNs, each with 
a different input scale. It can also integrate the FCN conver¬ 
sion of several recent advanced CNN models (e.g. VGG and 
GoogLeNet). The MM-VDN model can also be applied to 
other tasks, beyond the video description task. For example, 
replacing the recurrent network with a multi-label loss layer 
would allow multi-scale multi-label image classification. 

Future work includes investigating the FCN conversion 
of the recent advanced CNN models in MM-VDN frame¬ 
work, and addition of motion features. We are also inter¬ 
ested in investigating the LSTM bidirectional optimization 
and applying more complex language models to improve 
the LSTM language generation model. 
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MM-VDN predicts correct sentences 




MM-VDN predicts partially correct sentences 


GT: A man is cutting carrots. 

FGM: A person is cutting a chicken . 
LSTM-YT: A woman is peeling a potato. 
AlexNet: A man is slicing a tomato. 
MM-VDN: A man is slicing a carrot. 

GT: A child is riding a horse. 

FGM: A person is playing with a person. 
LSTM-YT: A man is running. 

AlexNet: A man is walking on the ground. 
MM-VDN: A man is riding a horse. 

GT: A boy is playing a guitar. 

FGM: A person is playing with a person. 
LSTM-YT: A women of dancing. 

AlexNet: A man is playing a horse. 
MM-VDN: A man is playing a guitar. 

GT: A boy is playing a piano. 

FGM: A person is playing the guitar. 
LSTM-YT: A man is playing the boxing. 
AlexNet: A man is playing the guitar. 
MM-VDN: A man is playing a piano. 
gtTa man is skateboarding. 

FGM: A person is playing with a person. 
LSTM-YT: A cat is running on a toy. 
AlexNet: A man is pushing a wall. 
MM-VDN: A man is doing a skateboard. 





MM-VDN makes errors 


GT: A cat is eating a small wedge of watermelon 
FGM: A person is cutting the water. 

LSTM-YT: A woman is eating an egg. 

AlexNet: A baby is drinking. 

MM-VDN: A cat is eating. 

GT: A doctor gives a shot to a baby 
FGM: A person is cutting a person. 

LSTM-YT: A man is putting a piece of hands. 
AlexNet: A man is putting a stick. 

MM-VDN: A man is playing with a baby. 



GT: A person is peeling a banana from the bottom. 
FGM: A person is cutting an onion. 

LSTM-YT: A woman is doing a card. 

AlexNet: A woman is peeling a apple. 

MM-VDN: A woman is cutting a potato. 


GT: A turtle is walking. 

FGM: A person is walking in the food. 
LSTM-YT: A baby is eating. 

AlexNet: A panda is eating. 

MM-VDN: A panda is walking. 

Table 4. Some example videos and predicted sentences. (GT) shows the ground truth sentences; (FGM) is the factor graph model in [22]; 
(LSTM-YT) is the model in [24]; (AlexNet) is the basic CNN + LSTM model; (MM-VDN) is our model. The top section shows videos 
where our MM-VDN model improves on others; the middle section shows videos where our model predicts part of the sentence correctly; 
the bottom shows videos where our model makes errors. 
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Table 5. Visualization of the heat maps and the histograms of multiscale feature value for sampled frames. 



GT: A man is cutting a carrot. MM-VDN: 




A man is slicing a carrot. 



GT: A man is tying in the machine. MM-VDN: A woman is typing. 



GT: A woman is mixing eggs. MM-VDN: A woman is mixing an egg. 




GT: A woman is cutting parsley. MM-VDN: A woman is cutting a vegetable. 



GT: A boy is playing a guitar. MM-VDN: A man is playing a guitar. 



GT: A man is playing a guitar with his feet. MM-VDN: A man is playing a guitar. 
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GT: A cat is playing in a box. MM-VDN: A cat is playing. 




GT: A person is peeling a potato MM-VDN: A man is peeling a potato. 




r 

: 













GT: A man is shooting at a target. MM-VDN: A man is shooting a target. 




GT: A person is slicing an 



onion MM-VDN: 



A man is slicing an onion. 
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GT: A man is riding a motorcycle. MM-VDN: A man is riding a motorcycle. 





GT: A woman is cutting green onion. MM-VDN: A woman is cutting a green onion. 









GT: A girl is slicing a potato into pieces. MM-VDN: A man is peeling a potato. 






GT: A girl is riding a horse. MM-VDN: A girl is riding a horse. 
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GT: A man is playing a guitar. MM-VDN: A man is playing a guitar. 



GT: People are dancing outside. MM-VDN: People are dancing. 




GT: A man is playing a guitar. MM-VDN: 
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